Group 25


- Harsh Maheshkumar Pandya
- Rushank Ghanshyam Sheta
- Shubhkirti Prasad

Problem Statement :

The main goal of our EDA is to know and understand the distribution of Video Game sales data over the years in different continents. Through the years there have been many Console releases; Sony PlayStation, Nintendo Wii and Xbox 360 being some of the most famous ones.


About the Dataset :

Our dataset contains a list of video games with sales greater than 100,000 copies. It was orginally generated by a scrape of vgchartz.com. For our analysis, the dataset was taken directly from Kaggle.

Fields include

  • Rank - Ranking of overall sales

  • Name - The games name

  • Platform - Platform of the games release (i.e. PC,PS4, etc.)

  • Year - Year of the game’s release

  • Genre - Genre of the game

  • Publisher - Publisher of the game

  • NA_Sales - Sales in North America (in millions)

  • EU_Sales - Sales in Europe (in millions)

  • JP_Sales - Sales in Japan (in millions)

  • Other_Sales - Sales in the rest of the world (in millions)

  • Global_Sales - Total worldwide sales.


Importing the dataset and the required libraries :

#install.packages("dplyr")
#install.packages("tidyverse")
#install.packages("gapminder")
#install.packages("GGally")
#install.packages("arm")
#install.packages("vcd")
#install.packages("devtools")
#install.packages("ade4")
#install.packages("rlang")
#install.packages("viridis")
#install.packages("plyr")     
library("plyr")
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(magrittr) 
library(stringr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::arrange()   masks plyr::arrange()
## ✖ purrr::compact()   masks plyr::compact()
## ✖ dplyr::count()     masks plyr::count()
## ✖ dplyr::desc()      masks plyr::desc()
## ✖ tidyr::extract()   masks magrittr::extract()
## ✖ dplyr::failwith()  masks plyr::failwith()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::id()        masks plyr::id()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::mutate()    masks plyr::mutate()
## ✖ dplyr::rename()    masks plyr::rename()
## ✖ purrr::set_names() masks magrittr::set_names()
## ✖ dplyr::summarise() masks plyr::summarise()
## ✖ dplyr::summarize() masks plyr::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gapminder)
library(broom)
library(viridis) 
## Loading required package: viridisLite
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(arm)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loading required package: lme4
## 
## arm (Version 1.13-1, built: 2022-8-25)
## 
## Working directory is /Users/rushank/My Drive/mac_gdrive/Exploratory Data Analysis/Project/5847734
library(vcd)
## Loading required package: grid
library(devtools)
## Loading required package: usethis
library(ade4)
library(rlang)
## 
## Attaching package: 'rlang'
## 
## The following objects are masked from 'package:purrr':
## 
##     %@%, flatten, flatten_chr, flatten_dbl, flatten_int, flatten_lgl,
##     flatten_raw, invoke, splice
## 
## The following object is masked from 'package:magrittr':
## 
##     set_names
#install_github("vqv/ggbiplot")
library(ggbiplot)
## Loading required package: scales
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:arm':
## 
##     rescale
## 
## The following object is masked from 'package:viridis':
## 
##     viridis_pal
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
#devtools::install_github("Ryo-N7/tvthemes")
invisible(theme_set(theme_bw()) + theme_update(plot.title = element_text(hjust = 0.5)))
number_ticks = function(n) {function(limits) pretty(limits, n)}
sales_data = read.csv('vgsales.csv')
dim(sales_data)
## [1] 16598    11
head(sales_data)
ABCDEFGHIJ0123456789
 
 
Rank
<int>
Name
<chr>
Platform
<chr>
Year
<chr>
Genre
<chr>
Publisher
<chr>
NA_Sales
<dbl>
EU_Sales
<dbl>
JP_Sales
<dbl>
11Wii SportsWii2006SportsNintendo41.4929.023.77
22Super Mario Bros.NES1985PlatformNintendo29.083.586.81
33Mario Kart WiiWii2008RacingNintendo15.8512.883.79
44Wii Sports ResortWii2009SportsNintendo15.7511.013.28
55Pokemon Red/Pokemon BlueGB1996Role-PlayingNintendo11.278.8910.22
66TetrisGB1989PuzzleNintendo23.202.264.22

Initial Analysis and Pre-processing

str(sales_data)
## 'data.frame':    16598 obs. of  11 variables:
##  $ Rank        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name        : chr  "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
##  $ Platform    : chr  "Wii" "NES" "Wii" "Wii" ...
##  $ Year        : chr  "2006" "1985" "2008" "2009" ...
##  $ Genre       : chr  "Sports" "Platform" "Racing" "Sports" ...
##  $ Publisher   : chr  "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
##  $ NA_Sales    : num  41.5 29.1 15.8 15.8 11.3 ...
##  $ EU_Sales    : num  29.02 3.58 12.88 11.01 8.89 ...
##  $ JP_Sales    : num  3.77 6.81 3.79 3.28 10.22 ...
##  $ Other_Sales : num  8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
##  $ Global_Sales: num  82.7 40.2 35.8 33 31.4 ...
summary(sales_data)
##       Rank           Name             Platform             Year          
##  Min.   :    1   Length:16598       Length:16598       Length:16598      
##  1st Qu.: 4151   Class :character   Class :character   Class :character  
##  Median : 8300   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 8301                                                           
##  3rd Qu.:12450                                                           
##  Max.   :16600                                                           
##     Genre            Publisher            NA_Sales          EU_Sales      
##  Length:16598       Length:16598       Min.   : 0.0000   Min.   : 0.0000  
##  Class :character   Class :character   1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Mode  :character   Mode  :character   Median : 0.0800   Median : 0.0200  
##                                        Mean   : 0.2647   Mean   : 0.1467  
##                                        3rd Qu.: 0.2400   3rd Qu.: 0.1100  
##                                        Max.   :41.4900   Max.   :29.0200  
##     JP_Sales         Other_Sales        Global_Sales    
##  Min.   : 0.00000   Min.   : 0.00000   Min.   : 0.0100  
##  1st Qu.: 0.00000   1st Qu.: 0.00000   1st Qu.: 0.0600  
##  Median : 0.00000   Median : 0.01000   Median : 0.1700  
##  Mean   : 0.07778   Mean   : 0.04806   Mean   : 0.5374  
##  3rd Qu.: 0.04000   3rd Qu.: 0.04000   3rd Qu.: 0.4700  
##  Max.   :10.22000   Max.   :10.57000   Max.   :82.7400
#converting year to a numeric column
sales_data$Year = as.numeric(sales_data$Year,na.rm = TRUE)
## Warning: NAs introduced by coercion
unique(sales_data$Year)
##  [1] 2006 1985 2008 2009 1996 1989 1984 2005 1999 2007 2010 2013 2004 1990 1988
## [16] 2002 2001 2011 1998 2015 2012 2014 1992 1997 1993 1994 1982 2003 1986 2000
## [31]   NA 1995 2016 1991 1981 1987 1980 1983 2020 2017
na_indices = is.na(sales_data)
na_values = sales_data[na_indices]
na_values
##   [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
##  [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [101] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [126] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [151] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [176] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [201] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [226] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [251] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# removing all the NA values
sales_data = na.omit(sales_data)
sales_data2 = data.frame(sales_data)
sales_data2[ , c("Platform", "Year", "Genre", "Publisher")] <- lapply(sales_data2[ , c("Platform", "Year", "Genre", "Publisher")], as.factor)
summary(sales_data2)
##       Rank           Name              Platform         Year     
##  Min.   :    1   Length:16327       DS     :2133   2009   :1431  
##  1st Qu.: 4136   Class :character   PS2    :2127   2008   :1428  
##  Median : 8295   Mode  :character   PS3    :1304   2010   :1259  
##  Mean   : 8293                      Wii    :1290   2007   :1202  
##  3rd Qu.:12442                      X360   :1235   2011   :1139  
##  Max.   :16600                      PSP    :1197   2006   :1008  
##                                     (Other):7041   (Other):8860  
##           Genre                             Publisher        NA_Sales      
##  Action      :3253   Electronic Arts             : 1339   Min.   : 0.0000  
##  Sports      :2304   Activision                  :  966   1st Qu.: 0.0000  
##  Misc        :1710   Namco Bandai Games          :  928   Median : 0.0800  
##  Role-Playing:1471   Ubisoft                     :  918   Mean   : 0.2654  
##  Shooter     :1282   Konami Digital Entertainment:  823   3rd Qu.: 0.2400  
##  Adventure   :1276   THQ                         :  712   Max.   :41.4900  
##  (Other)     :5031   (Other)                     :10641                    
##     EU_Sales          JP_Sales         Other_Sales        Global_Sales    
##  Min.   : 0.0000   Min.   : 0.00000   Min.   : 0.00000   Min.   : 0.0100  
##  1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.: 0.00000   1st Qu.: 0.0600  
##  Median : 0.0200   Median : 0.00000   Median : 0.01000   Median : 0.1700  
##  Mean   : 0.1476   Mean   : 0.07866   Mean   : 0.04832   Mean   : 0.5402  
##  3rd Qu.: 0.1100   3rd Qu.: 0.04000   3rd Qu.: 0.04000   3rd Qu.: 0.4800  
##  Max.   :29.0200   Max.   :10.22000   Max.   :10.57000   Max.   :82.7400  
## 

1. Univariate and Bivariate Analysis

A. Understanding the Distribution of Sales

ggplot(sales_data, aes(x=Global_Sales)) + geom_histogram(fill='navy', color='lightblue') + ggtitle('Global Sales - Histogram (bins=50)')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(sales_data, aes(x=Global_Sales)) + geom_histogram(bins=100, fill='navy', color='lightblue') + ggtitle('Lowest Global Sales - Histogram (bins=100)') + xlim(0, 5)
## Warning: Removed 204 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 2 rows containing missing values (`geom_bar()`).

# log transformation on Global Sales
ggplot(sales_data) + geom_histogram(aes(log(Global_Sales)), fill='navy', color='lightblue')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Inference:
From the above histogram plots, we can see that the data is highly skewed and mostof the games have sales less than 1 million dollars, hence while fitting the models we should Log transform the column.

B. Understanding the Distribution of Year

# checking out the Year column
sales_data %>% 
  group_by(Year) %>% 
  count() %>% 
  ggplot() + 
  geom_density(aes(Year)) + scale_x_continuous(breaks=number_ticks(12)) + ggtitle('Global Sales by Year')

C. Understanding the Distribution of Platform

# Sales by platform :
sales_data %>% group_by(Platform) %>% 
  dplyr::summarise(sales = sum(Global_Sales)) %>% ggplot() + 
  geom_bar(aes(reorder(Platform, sales), sales), stat = "identity", 
           fill = "steelblue") + 
  xlab("Platform") + ylab("Global sales") + scale_y_continuous(breaks=number_ticks(5)) + ggtitle('Global Sales by Platform') +
  coord_flip() 

# combining the platforms together to get a better representation
sales_data = sales_data %>% 
  mutate(platform2 = case_when(
  Platform %in% c("Wii", "DS", "3DS", "WiiU", "GC", "GB","GBA") ~ "Nintendo",
  Platform %in% c("X360", "XB", "XOne") ~ "XBox",
  Platform %in% c("PS3", "PS4", "PS2", "PS", "PSP", "PSV") ~ "PS",
  Platform == "PC" ~ "PC",
  Platform == "DC" ~ "Sega"
))
sales_data %>% group_by(platform2) %>% 
  dplyr::summarise(sales = sum(Global_Sales)) %>% ggplot() + 
  geom_bar(aes(reorder(platform2, sales), sales), stat = "identity", 
           fill = "steelblue") + 
  xlab("Platform") + ylab("Global sales") + scale_y_continuous(breaks=number_ticks(5)) + ggtitle('Global Sales for Most Popular Platforms') +
  coord_flip()

Inference:
From the above barplots we can observe the top 3 most popular platforms(most sales over the years) for playing games are PS2, XBox 360 and PS3 respectively. But when we group by the manufacturer of the consoles and count the aggregate sales, we find that Play Station is the top performer while XBox has lowe sales than Nintendo’s consoles, and Sega has the lowest of all.

D. Exploring Genre : Which Genre has most of the sales?

ggplot(sales_data, aes(x=Genre, y=Global_Sales, fill=Genre, color=Genre)) +
  geom_bar(stat="identity") + 
  theme(axis.text.x=element_text(size=10, angle=90), legend.position = 'bottom', aspect.ratio=10/15, legend.key.size = unit(0.3, 'cm')) +
  ggtitle('Overall Sales by Genre')

D. Exploring Sales by Publisher : Which publisher has the highest Global Sales

# Count occurrences of each value in the Col1 column
value_counts <- table(sales_data$Publisher)

# Extract values with a count of 1
values_with_count_1 <- names(value_counts[value_counts > 411])


# Filter the dataframe based on values with count 1 in Col1 column
filtered_df <- sales_data %>%
  filter(Publisher %in% values_with_count_1)

ggplot(filtered_df, aes(x=Publisher, y=Global_Sales, fill=Publisher)) + theme(legend.position="bottom", axis.text.y=element_text(size=8, angle=45), legend.key.size = unit(0.3, 'cm')) +
  geom_bar(stat="identity") + ggtitle("Overall Sales by Publisher") + coord_flip()

Inference:
From the above two barplots, we can observe that the Genres Action, Shooter and Sports contribute higer to the total sales while lowest purchases games belongs to genres like Puzzle, Strategy and Adventure.
And Publishers like Nintendo, Electronic Arts and Activision seems to be the market leaders in terms of Total sales over all of the years.

E. Pair Plot

# Select only numeric columns
num_cols = sapply(sales_data, is.numeric)
df_num = sales_data[, num_cols]

# Create ggpairs for numeric columns
ggpairs(df_num, title='Pair Plot')

Inference:
As anticipated, [NA, EU, JP, Other]Sales are highly correlated with Global Sales and each other which suggests that if a newly launched game has comparatively higher sales than its competitors in one region, then launching the same game in other regions might result in comparatively similar trend probablistically.
Year is not significantly correlated with any other attribute

F. Understanding the relation of No of games released with Global Sales

color <- c("Titles released" = "red", "Global sales" = "blue")

sales_data %>% 
  group_by(Year) %>% 
  dplyr::summarise(sales = sum(Global_Sales), count = n())%>%
  
  ggplot() + geom_line(aes(Year, count, group = 1, color = "Titles released")) + 
  
  geom_line(aes(Year, sales, group = 1, color = "Global sales")) + 
  
  ylab("Titles released") +
  
  scale_color_manual(values = color) + labs(color = "") + scale_x_continuous(breaks=number_ticks(12)) + ggtitle('Titles Released and Sales by Year')

Inference:

SALES DATA

# Log transforming the sales column to check the relationship
ggplot(sales_data, aes(x = Year, y = log10(Global_Sales))) + geom_point()

colnames(sales_data)
##  [1] "Rank"         "Name"         "Platform"     "Year"         "Genre"       
##  [6] "Publisher"    "NA_Sales"     "EU_Sales"     "JP_Sales"     "Other_Sales" 
## [11] "Global_Sales" "platform2"
table(sales_data$Year)
## 
## 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 
##    9   46   36   17   14   14   21   16   15   17   16   41   43   60  121  219 
## 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 
##  263  289  379  338  349  482  829  775  763  941 1008 1202 1428 1431 1259 1139 
## 2012 2013 2014 2015 2016 2017 2020 
##  657  546  582  614  344    3    1
#sales_data[sales_data$Year > 2016, ]
# Since there is not enough data past 2016, we shall limit our analysis only till 2016
sales_data = sales_data[sales_data$Year < 2017, ]
sales_by_year = aggregate(cbind(NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales) ~ Year, data = sales_data, sum)

2. Multivariate Analysis :

Line-Plots of the sales columns :

# Area comparison
ggplot(sales_by_year, aes(x = Year)) +
  geom_line(aes(y = NA_Sales, color = "NA_Sales")) +
  geom_line(aes(y = EU_Sales, color = "EU_Sales")) +
  geom_line(aes(y = JP_Sales, color = "JP_Sales")) +
  geom_line(aes(y = Other_Sales, color = "Other_Sales")) +
  geom_line(aes(y = Global_Sales, color = "Global_Sales")) +
  labs(x = "Year", y = "Sales", color = "Sales Type") +
  ggtitle("Sales by Year") +
  scale_color_manual(values = c("NA_Sales" = "red", "EU_Sales" = "blue", 
                                 "JP_Sales" = "green", "Other_Sales" = "orange", 
                                 "Global_Sales" = "black")) + theme(legend.position = 'bottom', aspect.ratio=10/15) + scale_x_continuous(breaks=number_ticks(12))

# Reshape data to long format
sales_by_year_long <- reshape2::melt(sales_by_year, id.vars = "Year", variable.name = "Sales_Location", value.name = "Sales")
#head(sales_by_year_long)
sales_by_year_short <- sales_by_year_long[sales_by_year_long$Sales_Location != "Global_Sales", ]
ggplot(data = sales_by_year_short) +
  geom_line(aes(x = Year, y = Sales, colour = Sales_Location, group=Sales_Location)) + scale_x_continuous(breaks=number_ticks(12)) +
  ggtitle('Reginal Sales Over the Years')

sales_data %>% group_by(platform2, Year) %>%
  dplyr::summarise(sales = sum(Global_Sales)) %>% 
  ggplot() + 
  geom_line(aes(Year, sales, group = platform2, color = platform2)) +
  xlab("Year of release") + ylab("Global Sales") + labs(color = "") + ggtitle('Global Sales for each Platform')
## `summarise()` has grouped output by 'platform2'. You can override using the
## `.groups` argument.

# Top 3 best sellers overall
global_sales_sorted <- sales_data[order(-sales_data$Global_Sales), ]
top_global_sales <- global_sales_sorted[1:3, c("Name", "Platform", "Global_Sales")]

print(top_global_sales)
##                Name Platform Global_Sales
## 1        Wii Sports      Wii        82.74
## 2 Super Mario Bros.      NES        40.24
## 3    Mario Kart Wii      Wii        35.82

Understanding Game, Platform Popularity in North America

# NA Subset and sort
na_sales <- sales_data[!is.na(sales_data$NA_Sales), ]
na_sales_sorted <- na_sales[order(-na_sales$NA_Sales), ]

# top 10 with highest sales
top_na_sales <- na_sales_sorted[1:10, c("Name", "Platform", "NA_Sales")]


t_v_name_NA <- aggregate(list(NA_Sales = sales_data$NA_Sales), list(Name = sales_data$Name), sum)
t_v_name_NA <- t_v_name_NA[order(t_v_name_NA$NA_Sales, decreasing = T), ]

ggplot(data = top_na_sales, aes(x = Name, y = NA_Sales, fill = Name)) + 
  geom_bar(stat = "identity", color='white') +
  labs(x = "Game Title", y = "NA Sales (in millions)", 
       title = "Top 10 Best Sellers in North America") + 
  theme(axis.text.y=element_text(size=7, angle=45), legend.position = 'none') + coord_flip()

# NA Subset and sort
na_sales <- sales_data[!is.na(sales_data$NA_Sales), ]
na_sales_sorted <- na_sales[order(-na_sales$NA_Sales), ]

# top 10 with highest sales
top_na_sales <- na_sales_sorted[1:10, c("Name", "Platform", "NA_Sales")]


t_v_name_NA <- aggregate(list(NA_Sales = sales_data$NA_Sales), list(Name = sales_data$Name), sum)
t_v_name_NA <- t_v_name_NA[order(t_v_name_NA$NA_Sales, decreasing = T), ]

ggplot(data = top_na_sales, aes(x = Name, y = NA_Sales, fill = Platform)) + 
  geom_bar(stat = "identity", color='white') +
  labs(x = "Game Title", y = "NA Sales (in millions)", 
       title = "Top 10 Best Sellers in North America") + 
  theme(axis.text.y=element_text(size=8, angle=45), legend.position = 'bottom') + coord_flip()

Understanding Game, Platform Popularity in Europe

# EU Subset and sort
eu_sales <- sales_data[!is.na(sales_data$EU_Sales), ]
eu_sales_sorted <- eu_sales[order(-eu_sales$EU_Sales), ]

# top 10 with highest sales
top_eu_sales <- eu_sales_sorted[1:10, c("Name", "Platform", "EU_Sales")]

ggplot(data = top_eu_sales, aes(x = Name, y = EU_Sales, fill = Platform)) + 
  geom_bar(stat = "identity", color='white') +
  labs(x = "Game Title", y = "EU Sales (in millions)", 
       title = "Top 10 Best Sellers in Europe") + 
  theme(axis.text.y=element_text(size=8, angle=45), legend.position = 'bottom') + coord_flip()

Understanding Game, Platform Popularity in Japan

# JP Subset and sort
jp_sales <- sales_data[!is.na(sales_data$JP_Sales), ]
jp_sales_sorted <- jp_sales[order(-jp_sales$JP_Sales), ]

# top 10 with highest sales
top_jp_sales <- jp_sales_sorted[1:10, c("Name", "Platform", "JP_Sales")]

ggplot(data = top_jp_sales, aes(x = Name, y = JP_Sales, fill = Platform)) + 
  geom_bar(stat = "identity", color='white') +
  labs(x = "Game Title", y = "JP Sales (in millions)", 
       title = "Top 10 Best Sellers in Japan") + 
  theme(axis.text.y=element_text(size=8, angle=45), legend.position = 'bottom') + coord_flip()

Understanding Game, Platform Popularity in Other Regions

# JP Subset and sort
o_sales <- sales_data[!is.na(sales_data$Other_Sales), ]
o_sales_sorted <- o_sales[order(-o_sales$Other_Sales), ]

# top 10 with highest sales
top_o_sales <- o_sales_sorted[1:10, c("Name", "Platform", "Other_Sales")]

ggplot(data = top_o_sales, aes(x = Name, y = Other_Sales, fill = Platform)) + 
  geom_bar(stat = "identity", color='white') +
  labs(x = "Game Title", y = "Other Sales (in millions)", 
       title = "Top 10 Best Sellers in Other Regions") + 
  theme(axis.text.y=element_text(size=8, angle=45), legend.position = 'bottom') + coord_flip()

Understanding Genre Popularity in North America

# NA Subset and sort
na_sales <- sales_data[!is.na(sales_data$NA_Sales), ]
na_sales_sorted <- na_sales[order(-na_sales$NA_Sales), ]

# top 10 with highest sales
top_na_sales <- na_sales_sorted[1:10, c("Name", "Platform", "NA_Sales", "Genre")]


t_v_name_NA <- aggregate(list(NA_Sales = sales_data$NA_Sales), list(Name = sales_data$Name), sum)
t_v_name_NA <- t_v_name_NA[order(t_v_name_NA$NA_Sales, decreasing = T), ]

ggplot(data = top_na_sales, aes(x = Name, y = NA_Sales, fill = Genre)) + 
  geom_bar(stat = "identity", color='white') +
  labs(x = "Game Title", y = "NA Sales (in millions)", 
       title = "Top 10 Best Sellers in North America ~ Genre") + 
  theme(axis.text.y=element_text(size=8, angle=45), legend.position = 'bottom') + coord_flip()

Understanding Genre Popularity in Europe

# EU Subset and sort
eu_sales <- sales_data[!is.na(sales_data$EU_Sales), ]
eu_sales_sorted <- eu_sales[order(-eu_sales$EU_Sales), ]

# top 10 with highest sales
top_eu_sales <- eu_sales_sorted[1:10, c("Name", "Platform", "EU_Sales", "Genre")]

ggplot(data = top_eu_sales, aes(x = Name, y = EU_Sales, fill = Genre)) + 
  geom_bar(stat = "identity", color='white') +
  labs(x = "Game Title", y = "EU Sales (in millions)", 
       title = "Top 10 Best Sellers in Europe ~ Genre") + 
  theme(axis.text.y=element_text(size=8, angle=45), legend.position = 'bottom') + coord_flip()

Understanding Genre Popularity in Japan

# JP Subset and sort
jp_sales <- sales_data[!is.na(sales_data$JP_Sales), ]
jp_sales_sorted <- jp_sales[order(-jp_sales$JP_Sales), ]

# top 10 with highest sales
top_jp_sales <- jp_sales_sorted[1:10, c("Name", "Platform", "JP_Sales", "Genre")]

ggplot(data = top_jp_sales, aes(x = Name, y = JP_Sales, fill = Genre)) + 
  geom_bar(stat = "identity", color='white') +
  labs(x = "Game Title", y = "JP Sales (in millions)", 
       title = "Top 10 Best Sellers in Japan ~ Genre") + 
  theme(axis.text.y=element_text(size=8, angle=45), legend.position = 'bottom') + coord_flip()

Understanding Genre Popularity in Other Regions

# JP Subset and sort
o_sales <- sales_data[!is.na(sales_data$Other_Sales), ]
o_sales_sorted <- o_sales[order(-o_sales$Other_Sales), ]

# top 10 with highest sales
top_o_sales <- o_sales_sorted[1:10, c("Name", "Platform", "Other_Sales", "Genre")]

ggplot(data = top_o_sales, aes(x = Name, y = Other_Sales, fill = Genre)) + 
  geom_bar(stat = "identity", color='white') +
  labs(x = "Game Title", y = "Other Sales (in millions)", 
       title = "Top 10 Best Sellers in Other Regions ~ Genre") + 
  theme(axis.text.y=element_text(size=8, angle=45), legend.position = 'bottom') + coord_flip()

Top Publisher for each year

years = unique(sales_data$Year)
df_top_p = data.frame(Year = integer(), 
                      Publisher = character(), 
                      Global_Sales = numeric(), 
                      stringsAsFactors = FALSE)

for (year in years) {
  # games released for current year
  year_sales <- subset(sales_data, Year == year)
  # Aggregate sales
  publisher_sales <- aggregate(Global_Sales ~ Publisher, year_sales, sum)
  
  # sory by descending and take the top
  top_publisher <- publisher_sales[order(-publisher_sales$Global_Sales), "Publisher"][1]
  # add into newdf
  df_top_p <- rbind(df_top_p, data.frame(Year = year, Publisher = top_publisher, Global_Sales = publisher_sales[publisher_sales$Publisher == top_publisher, "Global_Sales"]))
}

ggplot(data = df_top_p, aes(x = Year, y = Global_Sales, fill = Publisher)) +
  geom_col(position = "dodge") +
  labs(x = "Year", y = "Global Sales (in millions)", 
       title = "Top Publisher by Global Sales for Each Year") +
  theme(axis.text.x=element_text(size=8, angle=90), legend.position = 'bottom', legend.key.size = unit(0.2, 'cm')) + scale_x_continuous(breaks=number_ticks(21))

Inference:
As we can see from the above plot, Nintendo was the top Publisher for most of the years. How ever we can see a trend where Electronic Arts started selling more games than Nintendo from year 2002 until 2005 after which again Nintendo was top publisher. And lastly from the year 2010 Electronic Art’s sales were more than Nintendo.
Hence We are particularly interested in analyzing data from years 2005-2009 and 2010-2014 which will help us understand how Electronic Arts was able to beat it competitor Nintendo.

Further Analysis from year 2005-2009 [mid 5 years] and year 2010-2014[last 5 years]

#head(filtered_df)
mid5y = filtered_df %>% filter(Year>=2005, Year<=2009)
last5y = filtered_df %>% filter(Year>=2010, Year<=2014)
last5y
ABCDEFGHIJ0123456789
Rank
<int>
Name
<chr>
Platform
<chr>
Year
<dbl>
Genre
<chr>
17Grand Theft Auto VPS32013Action
24Grand Theft Auto VX3602013Action
27Pokemon Black/Pokemon WhiteDS2010Role-Playing
30Call of Duty: Modern Warfare 3X3602011Shooter
32Call of Duty: Black OpsX3602010Shooter
33Pokemon X/Pokemon Y3DS2013Role-Playing
35Call of Duty: Black Ops IIPS32012Shooter
36Call of Duty: Black Ops IIX3602012Shooter
38Call of Duty: Modern Warfare 3PS32011Shooter
41Call of Duty: Black OpsPS32010Shooter

Checking out the Sales for mid 5 years

ggplot(mid5y, aes(x=Year, y=Global_Sales))+
  geom_bar(stat='identity', fill="navy")+
  facet_wrap(~Publisher, nrow=2) + ggtitle('Sales from year 2005-2009 ~ Publisher') + theme(axis.text.x=element_text(size=9, angle=90))

ggplot(filter(mid5y, Publisher=='Activision' | Publisher=='Electronic Arts' | Publisher=='Nintendo'), aes(x=Year, y=Global_Sales))+
  geom_bar(stat='identity', fill="navy")+
  facet_wrap(~Publisher, nrow=1) + ggtitle('Sales from 2005-2009 ~ Publisher') + theme(axis.text.x=element_text(size=8, angle=90))

ggplot(last5y, aes(x=Year, y=Global_Sales))+
  geom_bar(stat='identity', fill="navy")+
  facet_wrap(~Publisher, nrow=2) + ggtitle('Sales from year 2010-2014 ~ Publisher') + theme(axis.text.x=element_text(size=9, angle=90))

ggplot(filter(last5y, Publisher=='Activision' | Publisher=='Electronic Arts' | Publisher=='Nintendo'), aes(x=Year, y=Global_Sales))+
  geom_bar(stat='identity', fill="navy")+
  facet_wrap(~Publisher, nrow=1) + ggtitle('Sales from 2010-2014 ~ Publisher') + theme(axis.text.x=element_text(size=8, angle=90))

Inference:
From the above two faceted plots, we can see that Nintendo, Electronic Arts and Acticision are having the best sales
In year 2005-2009 Nintendo sales were all time high about 200 million and averaged around 130 million over the period of 5 years, while Electronic Arts had average sales of 75 million followed by Activision with 60 million in sales
However, from year 2010 Nintendo’s sales dropped to average 55 million and Electronic Art’s average sales were highest around 60 million followed by Activision with average of 50 million.
In summary, we saw that Nintendo was having highest sales in mid-5-years while in the last-5-years Electronic Arts crossed Nintendo in average sales and Activision was also able to increase the sales comparatively.
We are interested in finding the probable reason why Nintendo’s sales dropped by around 50% while sales for Electronic Arts and Activision increased.

For number of game relased by publishers

count_last5y <- last5y %>%
  group_by(Publisher, Genre, Year) %>%
  dplyr::summarise(count = n())
## `summarise()` has grouped output by 'Publisher', 'Genre'. You can override
## using the `.groups` argument.
count_last5y
ABCDEFGHIJ0123456789
Publisher
<chr>
Genre
<chr>
Year
<dbl>
count
<int>
ActivisionAction201019
ActivisionAction201124
ActivisionAction201229
ActivisionAction20138
ActivisionAction201422
ActivisionAdventure20102
ActivisionAdventure20112
ActivisionAdventure20143
ActivisionMisc20108
ActivisionMisc201113
count_mid5y <- mid5y %>%
  group_by(Publisher, Genre, Year) %>%
  dplyr::summarise(count = n())
## `summarise()` has grouped output by 'Publisher', 'Genre'. You can override
## using the `.groups` argument.
#filter(count_last5y, Publisher=='Activision')[,2:4]
ggplot(data = count_mid5y, aes(x=Genre, y=count, fill=Publisher)) + 
    geom_bar(stat="identity") +
  facet_wrap('Year', nrow=1) + ylim(0,150) + 
  theme(axis.text.x=element_text(size=9, angle=90), legend.position = 'bottom', legend.key.size = unit(0.2, 'cm')) + ggtitle('Games Released for each Genre ~ mid-5-Years')

ggplot(data = count_last5y, aes(x=Genre, y=count, fill=Publisher)) + 
    geom_bar(stat="identity") +
  facet_wrap('Year', nrow=1) + ylim(0,150) + 
  theme(axis.text.x=element_text(size=9, angle=90), legend.position = 'bottom', legend.key.size = unit(0.3, 'cm')) + ggtitle('Games Released for each Genre ~ last-5-Years')

Inference:
The above plots are hard to interpret but we can spot a trend that the relatively higher number of games are realeased for Action category
To better understand similar kind of plots we will only consider publishers like “Nintendo”, “Activision” and “Electronic Arts” since we spotted a pattern for these publishers in previous plots for this set of years

Grouped Bar Plot for the top 3 publishers and all Genres

count_mid5y <- mid5y %>%
  group_by(Publisher, Genre, Year) %>%
  dplyr::summarise(count = n())
## `summarise()` has grouped output by 'Publisher', 'Genre'. You can override
## using the `.groups` argument.
count_mid5y
ABCDEFGHIJ0123456789
Publisher
<chr>
Genre
<chr>
Year
<dbl>
count
<int>
ActivisionAction200522
ActivisionAction20068
ActivisionAction200722
ActivisionAction200832
ActivisionAction200944
ActivisionAdventure20055
ActivisionAdventure20075
ActivisionAdventure20082
ActivisionAdventure20094
ActivisionMisc20054

Grouped Bar Plot for mid-5-years

ggplot(data = filter(count_mid5y, Publisher=='Activision' | Publisher=='Electronic Arts' | Publisher=='Nintendo'), aes(x=Year, y=count, fill=Publisher, group=Publisher)) + 
    geom_bar(stat="identity", position="dodge") +
  facet_wrap('Genre', nrow=1) + 
theme(axis.text.x=element_text(size=8, angle=90), legend.key.size = unit(0.2, 'cm')) + ggtitle('Games released by Publishers(2005-2009) ~ Genre') + ylim(0,40)
## Warning: Removed 5 rows containing missing values (`geom_bar()`).

Stacked Bar Plots for top 3 Publishers for mid-5-years with Genres in which most of the games were released

m1 = filter(count_mid5y, Publisher=='Activision' | Publisher=='Electronic Arts' | Publisher=='Nintendo')
m2 = filter(m1, Genre=='Action' | Genre=='Puzzle' | Genre=='Shooter' | Genre=='Sports' | Genre=='Racing' | Genre=='Misc')

ggplot(data = m2, aes(x=Year, y=count, fill=Publisher)) + 
    geom_bar(stat="identity") +
  facet_wrap('Genre', nrow=2) + 
  theme(axis.text.x=element_text(size=10, angle=90), legend.key.size = unit(0.2, 'cm'), legend.position = 'bottom') + ggtitle('Games released by Publishers(2005-2009) ~ Genre') + ylim(0,40)
## Warning: Removed 5 rows containing missing values (`position_stack()`).
## Warning: Removed 3 rows containing missing values (`geom_bar()`).

l1 = filter(count_last5y, Publisher=='Activision' | Publisher=='Electronic Arts' | Publisher=='Nintendo')
l2 = filter(l1, Genre=='Action' | Genre=='Puzzle' | Genre=='Shooter' | Genre=='Sports' | Genre=='Racing'| Genre=='Misc')

ggplot(data = l2, aes(x=Year, y=count, fill=Publisher)) + 
    geom_bar(stat="identity") +
  facet_wrap('Genre', nrow=2) + 
  theme(axis.text.x=element_text(size=10, angle=90), legend.key.size = unit(0.2, 'cm'), legend.position = 'bottom') + ggtitle('Games released by Publishers(2010-2014) ~ Genre') + ylim(0,40)
## Warning: Removed 3 rows containing missing values (`geom_bar()`).

Modelling:

sales_by_publisher = sales_data %>% group_by(Publisher) %>% 
  dplyr::summarise(sales = sum(Global_Sales)) %>% 
  arrange(desc(sales)) %>% slice(1:10)

top_publishers = head(sales_by_publisher$Publisher, n = 10)

filtered_df = sales_data %>%filter(Publisher %in% top_publishers)

sales_by_year_publisher = aggregate(Global_Sales ~ Publisher + Year, data = filtered_df, sum)

ggplot(sales_by_year_publisher, aes(x = Year, y = Global_Sales)) +
    geom_point(alpha=0.75,size=0.5) +
    geom_smooth(method = "loess", se = FALSE, span=0.75, color= 'blue', linewidth= 0.6) + 
  facet_wrap(~ Publisher, ncol = 5) +
  labs(title = "Trend of Total Global Sales by Publisher and Year",
       x = "Year",
       y = "Total Global Sales") +
      theme(axis.text.x=element_text(size=9))
## `geom_smooth()` using formula = 'y ~ x'

#geom_smooth(method = "lm",color= 'red', se =FALSE,linewidth= 0.8) +
sales_by_genre_year_publisher = aggregate(Global_Sales ~ Genre + Year, data = filtered_df, sum)
ggplot(sales_by_genre_year_publisher, aes(x = Year, y = Global_Sales)) +
  geom_point(alpha=0.75,size=0.5) +
    geom_smooth(method = "loess", se = FALSE, span=0.75, color= 'blue', linewidth= 0.6) +
  facet_wrap(~ Genre, ncol = 4) +
  labs(title = "Total Global Sales by Genre and Year",
       x = "Year",
       y = "Total Global Sales") +
  theme(axis.text.x=element_text(size=9))
## `geom_smooth()` using formula = 'y ~ x'

Modelling:

sales_by_publisher = sales_data %>% group_by(Publisher) %>% 
  dplyr::summarise(sales = sum(Global_Sales)) %>% 
  arrange(desc(sales)) %>% slice(1:10)

top_publishers = head(sales_by_publisher$Publisher, n = 10)

filtered_df = sales_data %>%filter(Publisher %in% top_publishers)

sales_by_year_publisher = aggregate(Global_Sales ~ Publisher + Year, data = filtered_df, sum)

ggplot(sales_by_year_publisher, aes(x = Year, y = Global_Sales)) +
    geom_point(alpha=0.75,size=0.5) +
    geom_smooth(method = "loess", se = FALSE, span=0.75, color= 'blue', linewidth= 0.6) + 
  geom_smooth(method = "lm",color= 'red', se =FALSE,linewidth= 0.8) +
  facet_wrap(~ Publisher, ncol = 5) +
  labs(title = "Trend of Total Global Sales by Publisher and Year",
       x = "Year",
       y = "Total Global Sales") +
      theme(axis.text.x=element_text(size=9))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

sales_by_genre_year_publisher = aggregate(Global_Sales ~ Genre + Year, data = filtered_df, sum)
ggplot(sales_by_genre_year_publisher, aes(x = Year, y = Global_Sales)) +
  geom_point(alpha=0.75,size=0.5) +
    geom_smooth(method = "loess", se = FALSE, span=0.75, color= 'blue', linewidth= 0.6) + 
 geom_smooth(method = "lm",color= 'red', se =FALSE,linewidth= 0.8) +
  facet_wrap(~ Genre, ncol = 4) +
  labs(title = "Total Global Sales by Genre and Year",
       x = "Year",
       y = "Total Global Sales") +
  theme(axis.text.x=element_text(size=9))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Models :

Main aim:
The main aim behind our model is the check : - If the sales of the games in Japan influenced by sales of the games in any other part of the world. - How is does the sales in Japan contribute to the overall global sales of the games.

filtered_lo = loess(JP_Sales ~ NA_Sales * Global_Sales, data = filtered_df, span = 1/3, parametric = "NA_Sales", drop.square = "NA_Sales", family = "symmetric")

prediction_grid = data.frame(expand.grid(NA_Sales = seq(0.00, 42, by = 6.9), Global_Sales = seq(0.00, 83, by = 8)))

filtered_preds = augment(filtered_lo, newdata = prediction_grid)

    
ggplot(filtered_preds) +
    geom_line(aes(x = NA_Sales, y = .fitted)) +
    facet_wrap(~ Global_Sales, ncol = 11)  +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
    scale_y_continuous("Fitted value of JP_Sales")
## Warning: Removed 7 rows containing missing values (`geom_line()`).

  • The plot shows the relationship between NA_Sales and the predicted values of JP_Sales based on the fitted loess model, with facets for different values of Global_Sales. The plot provides insights into the nature of the relationship between the variables and how it varies with different values of Global_Sales
ggplot(filtered_preds) +
    geom_line(aes(x = Global_Sales, y = .fitted, 
                  color = factor(NA_Sales, levels = unique(NA_Sales), ordered = TRUE))) +
    guides(color = guide_legend(title = "NA_Sales")) +
    scale_y_continuous("Fitted value of JP_Sales")
## Warning: Removed 7 rows containing missing values (`geom_line()`).

  • Here we see that the fitted values of “JP_Sales” are changing systematically with respect to both “Global_Sales” and “NA_Sales”. The intersecting lines would suggest that the relationship between “JP_Sales” and “Global_Sales” is different at different levels of “NA_Sales”, and vice versa. This indicates that the relationship between “JP_Sales” and both “Global_Sales” and “NA_Sales” is moderated by the other variable.
filtered_lo = loess(JP_Sales ~  Global_Sales * NA_Sales, data = filtered_df, span = 1/3, parametric = "Global_Sales", drop.square = "Global_Sales", family = "symmetric")

prediction_grid = data.frame(expand.grid(Global_Sales = seq(0.00, 83, by = 6.9), NA_Sales = seq(0.00, 43, by = 8)))

filtered_preds = augment(filtered_lo, newdata = prediction_grid)

    
ggplot(filtered_preds) +
    geom_line(aes(x = Global_Sales, y = .fitted)) +
    facet_wrap(~ NA_Sales, ncol = 11)  +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
    scale_y_continuous("Fitted value of JP_Sales")+
    ggtitle('Interaction of Global_sales and JP_Sales faceted by NA_Sales')
## Warning: Removed 2 rows containing missing values (`geom_line()`).

  • Here we parallel lines, it means that there is no significant interaction between the “NA_Sales” variable and the “Global_Sales” variable in terms of their impact on the fitted values (“JP_Sales”) as shown in the plot.
ggplot(filtered_preds) +
    geom_line(aes(x = NA_Sales, y = .fitted, 
                  color = factor(Global_Sales, levels = unique(Global_Sales), ordered = TRUE))) +
    guides(color = guide_legend(title = "Global_Sales")) +
    scale_y_continuous("Fitted value of JP_Sales")+
    ggtitle('Global_sales given and NA_Sales faceted by NA_Sales')
## Warning: Removed 12 rows containing missing values (`geom_line()`).

  • Here, the fitted values of “JP_Sales” are not changing systematically with respect to “NA_Sales” for different levels of “Global_Sales”. The parallel lines would suggest that the relationship between “JP_Sales” and “NA_Sales” is consistent across different levels of “Global_Sales”. This may indicate that the “Global_Sales” variable does not have a significant effect on the relationship between “JP_Sales” and “NA_Sales”, and the relationship between “JP_Sales” and “NA_Sales” is not moderated by “Global_Sales”.

Predictive Models :

  • Let’s first augment the data with the user and critic ratings in order to see how they affect the global sales and regional sales :
sales_data = read.csv('vgsales_scores.csv')
sales_data
ABCDEFGHIJ0123456789
Name
<chr>
Platform
<chr>
Year_of_Release
<chr>
Genre
<chr>
Wii SportsWii2006Sports
Super Mario Bros.NES1985Platform
Mario Kart WiiWii2008Racing
Wii Sports ResortWii2009Sports
Pokemon Red/Pokemon BlueGB1996Role-Playing
TetrisGB1989Puzzle
New Super Mario Bros.DS2006Platform
Wii PlayWii2006Misc
New Super Mario Bros. WiiWii2009Platform
Duck HuntNES1984Shooter
sales_data$Year = as.numeric(sales_data$Year,na.rm = TRUE)
## Warning: NAs introduced by coercion
sales_data = na.omit(sales_data)
summary(sales_data$Critic_Score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   62.00   72.00   70.26   80.00   98.00
summary(sales_data$Critic_Count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   14.00   24.00   28.84   39.00  113.00
summary(sales_data$User_Count)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    11.0    27.0   174.4    89.0 10665.0
summary(sales_data$User_Score)
##    Length     Class      Mode 
##      6894 character character
sales_data$User_Score = as.numeric(sales_data$User_Score)
summary(sales_data$User_Score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.500   6.500   7.500   7.184   8.200   9.600
  • Getting critic score the and user score to get them on the same scale :
sales_data$User_Score =  sales_data$User_Score * 10
colors = c("Critic score" = "blue", "User score" = "red")
ggplot(sales_data) + 
  
  # relating the critic_score to the global sales
  geom_smooth(aes(Critic_Score, Global_Sales, color = "Critic score")) + 
  # relating the user_score to the global sales
  geom_smooth(aes(User_Score, Global_Sales, color = "User score")) +
  labs(color = "") + xlab("Score") + ylab("Global sales") + 
  scale_color_manual(values = colors)+
  ggtitle('Global_sales given by User and Critic Score')
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

- Here we see that the critic score is a more influential factor to the global sales.

colors = c("Critic score" = "blue", "User score" = "red")
ggplot(sales_data) + 
  
  # relating the critic_score to the global sales
  geom_smooth(aes(Critic_Score, JP_Sales, color = "Critic score")) + 
  # relating the user_score to the global sales
  geom_smooth(aes(User_Score, JP_Sales, color = "User score")) +
  labs(color = "") + xlab("Score") + ylab("JP sales") + 
  scale_color_manual(values = colors)+
  ggtitle('Japanese Sales given by User and Critic Score')
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

  • Here we see that as compared to the other regions of the world Japanese sales are almost equally affected by the critic and the user ratings.

###Conclusions:

Overall we can draw alot of peculiar conclusions from our observations from our analysis and modeling:

  • Analysis conclusions:

  • Modeling conclusions :

    • Japan contributed consistently to the global sales over the years, and was least affected by the sales trends in the other regions.
    • Europe, North America and all the other regions followed very similar trends of game sales over the years.
    • The sales in all the other regions was highly affected by the ‘Critic-score’ values, however, in Japan. the sales were equally affected by the user as well as the crtic score.